Paper Review

https://doi.org/10.1080/01621459.2022.2116331

Summary

  1. Clustering using agglomerative (hierarchical) clustering is common.

  2. Hypothesis testing of the difference in means of such clusters has inflated Type I error rate.

  3. Clustering and hypothesis testing performed on same dataset (‘double dipping’) is the cause.

  4. Selective inference can control Type I error rate.

Hypothesis Testing

A staple of modern statistics!

\[H_0: \textrm{Null hypothesis}\]

\[H_A: \textrm{Alternative hypothesis}\]

For example,

\(H_0:\) means of two clusters are indistinguishable (\(\mu_\mathcal{C_1} = \mu_\mathcal{C_2}\))

\(H_A:\) means of two clusters are different (\(\mu_\mathcal{C_1} \ne \mu_\mathcal{C_2}\))

p-value

  1. Statistic \(T\) summarises the data and has a known distribution, e.g., \(T \sim \chi^2_N\).

  2. Calculate that statistic for the observed data, e.g., \(T_\textrm{obs}\).

Definition of p-value

Assuming the null hypothesis, \(H_0\), is true; what is the probability of getting \(T \ge T_\textrm{obs}\) or something more extreme?

\[p = \mathbb{P}_{H_0}(T \ge T_\textrm{obs})\]

A “small” p-value suggests that the data is incompatible with assumption that \(H_0\) is true.

But how small is “small”?

Type I Error Rate, \(\alpha\)

A Type I error is rejecting \(H_0\) when it is actually true.

\[\begin{array}{c|cc} & H_0 \textrm{ is True} & H_0 \textrm{ is False} \\ \hline \textrm{Reject }H_0 & \textbf{Type I error} & \textrm{True positive} \\ \textrm{Fail to reject }H_0 & \textrm{True negative} & \textrm{Type II error} \end{array}\]

In the long run, we can control our error rate to \(\alpha \in [0,1]\) by only rejecting \(H_0\) when \(p \lt \alpha\).

\(\alpha\) is decided by the investigator.

Agglomerative Clustering

A “bottom-up” method of hierarchical clustering:

  1. Treat each observation as a cluster
  2. Calculate similarity between clusters
  3. Merge clusters together based on linkage criteria.
  4. Update the dendrogram which tracks the clusters.
  5. Continue until all clusters are merged.

Example

Example

Hypothesis Test of Means

Hypothesis Test of Means

 cluster  emmean     SE df lower.CL upper.CL
 1        0.0522 0.0737 97   -0.094    0.198
 2        0.7704 0.0796 97    0.612    0.928
 3       -0.7147 0.0737 97   -0.861   -0.568

Confidence level used: 0.95 
 contrast            estimate    SE df t.ratio p.value
 cluster1 - cluster2   -0.718 0.108 97  -6.622  <.0001
 cluster1 - cluster3    0.767 0.104 97   7.359  <.0001
 cluster2 - cluster3    1.485 0.108 97  13.692  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates 

But!

\[ X_1 \sim N(0,1),\ X_2 \sim N(0,1) \qquad X_1 \perp X_2\]

“Double Dipping”

Using the same data to generate a hypothesis and test that hypothesis.

e.g., cluster the data and then test the different of means of those clusters.

Sample Splitting

Sample Splitting

 cluster  emmean     SE df lower.CL upper.CL
 1       -0.0327 0.0947 47   -0.223    0.158
 2        0.7462 0.1114 47    0.522    0.970
 3       -0.8094 0.0922 47   -0.995   -0.624

Confidence level used: 0.95 
 contrast            estimate    SE df t.ratio p.value
 cluster1 - cluster2   -0.779 0.146 47  -5.326  <.0001
 cluster1 - cluster3    0.777 0.132 47   5.877  <.0001
 cluster2 - cluster3    1.556 0.145 47  10.756  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates 

There is still leakage of information.

Selective Inference

We have a matrix normal distribution:

\[ \boldsymbol{X} \sim \mathcal{MN_{n \times q}}(\boldsymbol\mu, \boldsymbol I_n, \sigma^2\boldsymbol{I}_q)\]

where \(\boldsymbol\mu\) has as rows vectors \(\mu_i\) with \(q\) elements.

After applying clustering to \(\boldsymbol{x}\) we get the clusters \(\{\mathcal{\hat{C}}_1, \dots, \mathcal{\hat{C}}_K\}\).

The mean of any cluster \(\mathcal{\hat{C}_k}\) in :

\[\bar\mu_{\mathcal{\hat{C}_k}} = \frac{1}{|\mathcal{\hat{C}}_k|} \sum_{i \in \mathcal{\hat{C}}_k} \mu_i\]

We want to test

\[H_0: \bar\mu_{\mathcal{\hat{C}_{k}}} = \bar\mu_{\mathcal{\hat{C}_{k'}}}\]

Hypothesis Test

Wald Test

\[p = \mathbb{P}\left( \chi^2_q \ \ge \ \kappa \| \bar{x}_{\hat{\mathcal{C}}_k} - \bar{x}_{\hat{\mathcal{C}}_{k'}} \|_2^2 \right)\]

Corrected Test

\[p = \mathbb{P}_{H_0}\left( \| \bar{X}_{\hat{\mathcal{C}}_k} - \bar{X}_{\hat{\mathcal{C}}_{k'}} \|_2 \ \ge \ \| \bar{x}_{\hat{\mathcal{C}}_k} - \bar{x}_{\hat{\mathcal{C}}_{k'}} \|_2 \mid \textrm{Clustering results in }\hat{\mathcal{C}}_k, \hat{\mathcal{C}}_{k'} \right)\]

New p-value:

Of all the datasets that result in clusters \(\hat{\mathcal{C}}_k\) and \(\hat{\mathcal{C}}_{k'}\), what is the probability, assuming no difference in means, that we see such a large difference in the sample means \(\bar\mu_{\mathcal{\hat{C}_k}}\) and \(\bar\mu_{\mathcal{\hat{C}_{k'} }}\)?

Example: Palmer Penguins

           
             1  2  3  4  5  6
  Adelie    60 12  1  0  0  0
  Chinstrap  4  2  0  0 27  1
  Gentoo     0  0  0 57  1  0

Example

Cluster 1 vs Cluster 2

$stat
[1] 10.41961

$pval
[1] 0.8667368

$trunc
Object of class Intervals
2 intervals over R:
(10.3122059802435, 16.3904453676166)
(131.758884280402, Inf)

Cluster 4 vs Cluster 5

$stat
[1] 18.86523

$pval
[1] 0.0004528178

$trunc
Object of class Intervals
2 intervals over R:
(16.8300111004978, 21.6868651391918)
(63.4548051430845, Inf)

Takeaways

  • We’re told to form hypotheses first before looking at the data but that’s not what happens in practice.

  • Double dipping will increase Type I error rate.

  • Data splitting does not mitigate double dipping and is wasteful.

  • Instead, condition on fact that the hypothesis was generated from the data.

  • Multiple testing is not always easy to detect.

  • Same issue appears in classification and regression trees, and changepoint detection.